Skip to content

API: make min/max on empty datetime df consistent with datetime serie… #33911

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

CloseChoice
Copy link
Member

@CloseChoice CloseChoice commented May 1, 2020

…s (#33704)

Fixes the following issue:

import pandas as pd
df = pd.DataFrame(dict(x=pd.to_datetime([])))
df.max()

throws a ValueError but

pd.Series(pd.to_datetime([])).max()

results in NaT.
With this change, calling an DataFrame on en empty pd.to_datetime([]), results in

In[1]: df = pd.DataFrame(dict(x=pd.to_datetime([])))
In[2]: df.max()
Out[5]: 
x   NaT
dtype: datetime64[ns]

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pls add a whatsnew note as well, 1.1 bug fixes datetimelike section

@@ -384,8 +384,7 @@ def _na_for_min_count(
else:
assert axis is not None # assertion to make mypy happy
result_shape = values.shape[:axis] + values.shape[axis + 1 :]
result = np.empty(result_shape, dtype=values.dtype)
result.fill(fill_value)
result = np.full(result_shape, fill_value)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this not have a dtype parameter? does it matter?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The result parameter here is what is making all the trouble. We set the dtype to datetime but our NaT is treated as NaN which numpy internally tries to cast into an integer which yields the error:
ValueError: cannot convert float NaN to integer.
The conversion happens somewhere inside the c-part of numpy, which I'm not familiar with.

But to come back to your question: for our result, specifying the dtype is not necessary. The given outputs are all corresponding to the type of the input. But good point, I should probably add this check to my tests.

def test_sum_empty_df_series():
# Calling the following defined sum function returned an error for dataframes but
# returned NaT for series. # Check that the API is consistent in this sense when
# operating on empty Series/DataFrames. See GH:33704 for more information
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

frame tests go here (you can rename slightly); these should eventually be moved into test_reductions.py

class TestDataFrameReductions:
def test_min_max_dt64_with_NaT

series are here:
tests/series/test_reductions.py

in tests/frame/test_analytics.py

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

# returned NaT for series. # Check that the API is consistent in this sense when
# operating on empty Series/DataFrames. See GH:33704 for more information
df = pd.DataFrame(dict(x=pd.to_datetime([])))
series = pd.Series(pd.to_datetime([]))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for frame also need to test both axis=

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@jreback jreback added Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Datetime Datetime data dtype labels May 1, 2020
@pep8speaks
Copy link

pep8speaks commented May 1, 2020

Hello @CloseChoice! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2020-05-07 22:16:22 UTC

assert (df.max(axis=0).x is NaT) == (expected_dt_series.max() is NaT)

# check axis 1
tm.assert_series_equal(df.min(axis=1), expected_dt_series)
Copy link
Member Author

@CloseChoice CloseChoice May 1, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just want to mention that calling min on axis=1 of an empty dataframe does not return the same as calling it on a Series, since this it currently not possible (and I don't think it is desired) with the implementation of reduce-operations. This test checks that when reducing over the index a Dataframe behaves like a series.

@CloseChoice
Copy link
Member Author

pls add a whatsnew note as well, 1.1 bug fixes datetimelike section

done

@CloseChoice CloseChoice requested a review from jreback May 6, 2020 19:40
@@ -557,6 +557,7 @@ Datetimelike
- Bug in :meth:`DatetimeIndex.intersection` losing ``freq`` and timezone in some cases (:issue:`33604`)
- Bug in :class:`DatetimeIndex` addition and subtraction with some types of :class:`DateOffset` objects incorrectly retaining an invalid ``freq`` attribute (:issue:`33779`)
- Bug in :class:`DatetimeIndex` where setting the ``freq`` attribute on an index could silently change the ``freq`` attribute on another index viewing the same data (:issue:`33552`)
- Bug in :meth:`nanops._na_for_min_count` when called with empty :class:`DataFrame` of ``timedelta64`` dtype (:issue:`33911`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you make a userfacing note. IOW a user wants to know, what changed. you are referring to an internal routine.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

# calling np.full with dtype parameter throws an ValueError when called
# with np.datetime64 and pd.NaT
try:
result = np.full(result_shape, fill_value, dtype=values.dtype)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what hits each of these branches?

Copy link
Member Author

@CloseChoice CloseChoice May 7, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when np.full is called with the parameter dtype=np.datetime64 and fill_value=pd.NaT a ValueError ValueError: cannot convert float NaN to integer. In this case I don't call np.full without an explicit dtype. Giving the array result an explicit dtype does not have an effect on the DataFrame column but I thought it is cleaner if I at least try to give it the correct one. I hoped the comment explains this. But if not, I'm gonna make it gonna improve the comment.

@@ -1258,3 +1258,26 @@ def test_min_max_dt64_with_NaT(self):
res = df.max()
exp = pd.Series([pd.NaT], index=["foo"])
tm.assert_series_equal(res, exp)

# Calling the following sum functions returned an error for dataframes but
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make a new test

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@CloseChoice CloseChoice requested a review from jreback May 7, 2020 23:09
@jreback jreback added this to the 1.1 milestone May 9, 2020
@jreback jreback merged commit ddbeca6 into pandas-dev:master May 9, 2020
@jreback
Copy link
Contributor

jreback commented May 9, 2020

thanks @CloseChoice

@CloseChoice CloseChoice deleted the API-consistent-empty-datetime-max-or-min branch May 9, 2020 20:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Datetime Datetime data dtype Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: min/max of empty datetime dataframe raises
3 participants